A Compressed Text Index on Secondary Memory

نویسندگان

  • Rodrigo González
  • Gonzalo Navarro
چکیده

We introduce a practical disk-based compressed text index that, when the text is compressible, takes much less space than the suffix array. It provides good I/O times for searching, which in particular improve when the text is compressible. In this aspect our index is unique, as most compressed indexes are slower than their classical counterparts on secondary memory. We analyze our index and show experimentally that it is extremely competitive on compressible texts. As side contributions, we introduce a compressed rank dictionary for secondary memory operating in one I/O access, as well as a simple encoding of sequences that achieves high-order compression and provides constant-time random access, both in main and secondary memory.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Space-efficient construction of Lempel-Ziv compressed text indexes

A compressed full-text self-index is a data structure that replaces a text and in addition gives indexed access to it, while taking space proportional to the compressed text size. This is very important nowadays, since one can accommodate the index of very large texts entirely in main memory, avoiding the slower access to secondary storage. In particular, the LZ-index [G. Navarro, Journal of Di...

متن کامل

Distributed search based on self-indexed compressed text

Query response times within a fraction of a second in Web search engines are feasible due to the use of indexing and caching techniques, which are devised for large text collections partitioned and replicated into a set of distributed memory processors. This paper proposes an alternative query processing method for this setting, which is based on a combination of self-indexed compressed text an...

متن کامل

Parallel and Distributed Compressed Indexes

We study parallel and distributed compressed indexes. Compressed indexes are a new and functional way to index text strings. They exploit the compressibility of the text, so that their size is a function of the compressed text size. Moreover, they support a considerable amount of functions, more than many classical indexes. We make use of this extended functionality to obtain, in a shared-memor...

متن کامل

Compressed Text Indexes with Fast Locate

Compressed text (self-)indexes have matured up to a point where they can replace a text by a data structure that requires less space and, in addition to giving access to arbitrary text passages, support indexed text searches. At this point those indexes are competitive with traditional text indexes (which are very large) for counting the number of occurrences of a pattern in the text. Yet, they...

متن کامل

CHICO: A Compressed Hybrid Index for Repetitive Collections

Indexing text collections to support pattern matching queries is a fundamental problem in computer science. New challenges keep arising as databases grow, and for repetitive collections, compressed indexes become relevant. To successfully exploit the regularities of repetitive collections different approaches have been proposed. Some of these are Compressed Suffix Array, Lempel-Ziv, and Grammar...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007